Analyzing the Relationship Between Billboard Hot 100 Placement and Song Qualities

By: Tien Doan, Iris Hu, Shannen Lam

Introduction

Every December, Spotify releases "Spotify Wrapped," a summary of your most played songs, accompanied by a playlist filled with those songs. This year, Spotify added on "Your Decade Wrapped," a series of playlists summarizing a user's top songs of the 2010s. Earlier in November, Billboard also released their "100 Songs That Defined the 2010s."

This had our group curious on song trends over the course of time. Is there a correlation between all the songs that have been popular? Are louder songs more successful? Have artists perfected some kind of formula to guarantee their songs' success?

So, using Spotify's "Get Audio Features for a Track" API, we analyzed the traits of songs from the Billboard Hot 100 charts over the course of about a year (Jan 2019 to Nov 2019) in order to see if there really is an overarching trend that popular songs follow.

Table of Contents

  1. Getting Started
    1. Required Libraries
  2. Data Collection & Preprocessing
    1. Getting the Data into a Dataframe
    2. Remove Duplicates and Count Occurrences
  3. Exploratory Data Analysis & Visualization
    1. Song Trait Trends
      1. Tempo
      2. Valence
      3. Loudness
      4. Energy
      5. Key
      6. Song Trait Conclusions
    2. Trends in the Top 50 Songs
      1. Tempo
      2. Valence
      3. Loudness
      4. Energy
      5. Key
      6. Top 50 Conclusions
    3. Grouping by Artists
  4. Insights and Conclusions
  5. Future Exploration

1. Getting Started

1.1 Required Libraries

We will need the following:

  • Spotipy: To obtain Spotify API data
  • Billboard.py: To obtain Billboard Hot 100 chart data
  • Pandas: To manipulate and analyze data
  • Datetime: To create dates
  • Time: To use sleep between API calls
  • Numpy: To help arrange bar graph ticks
  • Matplotlib.pyplot: To make non-interactive graphs
  • Seaborn: To customize graphs and make violin plots
  • Plotly.express: To make interactive graphs
  • Sklearn: To perform linear regression analysis
In [11]:
import pandas as pd
import billboard
import datetime
import seaborn as sns
import time

import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import plotly.express as px
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

2. Data Collection & Preprocessing

2.1 Getting the Data into a Dataframe

The first two steps in the data life cycle are data collection and data preprocessing. Since we need both Spotify audio features and Billboard chart data, we focus on finding APIs and libraries to retrieve that information.

Below, we use a convenient billboard.py library to retrieve Billboard Hot 100 chart data for the past year. Unfortunately, retrieving the entire year's data rate limits us, so we break it up into 2 month chunks with some sleeping in between to give the API a break.

We store each week's chart into a larger charts array for later processing.

In [0]:
charts = []

year = 2019
begin_year = datetime.date(year, 1, 1)
end_year = datetime.date(year, 3, 1)
one_week = datetime.timedelta(weeks=1)

next_week = begin_year
while next_week < end_year:
  date_str = next_week.strftime("%Y-%m-%d")
  chart = billboard.ChartData('hot-100', date=date_str)
  charts.append(chart)
  next_week += one_week

print('Finished 1-1 through 3-1')
time.sleep(100)

end_year = datetime.date(year, 5, 1)
while next_week < end_year:
  date_str = next_week.strftime("%Y-%m-%d")
  chart = billboard.ChartData('hot-100', date=date_str)
  charts.append(chart)
  next_week += one_week

print('Finished 3-1 through 5-1')
time.sleep(100)

end_year = datetime.date(year, 7, 1)
while next_week < end_year:
  date_str = next_week.strftime("%Y-%m-%d")
  chart = billboard.ChartData('hot-100', date=date_str)
  charts.append(chart)
  next_week += one_week

print('Finished 5-1 through 7-1')
time.sleep(100)

end_year = datetime.date(year, 9, 1)
while next_week < end_year:
  date_str = next_week.strftime("%Y-%m-%d")
  chart = billboard.ChartData('hot-100', date=date_str)
  charts.append(chart)
  next_week += one_week

print('Finished 7-1 through 9-1')
time.sleep(100)

end_year = datetime.date(year, 11, 1)
while next_week < end_year:
  date_str = next_week.strftime("%Y-%m-%d")
  chart = billboard.ChartData('hot-100', date=date_str)
  charts.append(chart)
  next_week += one_week

print('Finished 9-1 through 11-1')
Finished 1-1 through 3-1
Finished 3-1 through 5-1
Finished 5-1 through 7-1
Finished 7-1 through 9-1
Finished 9-1 through 11-1

This is how many weeks of chart data we have!

In [0]:
print(len(charts))
44

Now that we have all the Billboard chart data that we need, it's time to grab the audio features for all of the songs in those charts.

Below, we import a nice Python library for the Spotify Web API: Spotipy. In order to retrieve audio features for a track, we first need a track URI, URL, or ID. Unfortunately, the Billboard chart data does not give us that information, so we're going to have to get it ourselves.

The Spotify Web API provides an endpoint called search that lets you query their database for track URIs. The only thing is, the query has to be formatted a specific way: "artist:[artist_name] track:[song_title]", so we have to create those query strings. Later on, we'll also need the song title, artist, as well as its rank on the Hot 100 chart, so we'll also keep track of those in a tuple.

In [0]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

client_id = %env CLIENT_ID
client_sec = %env CLIENT_SECRET
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_sec)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
In [0]:
hot_100_tracks = []
for chart in charts:
  for song in chart:
    hot_100_tracks.append(('artist:' + song.artist.split()[0] + ' track:' + song.title, song.title, song.artist, song.rank))
print(len(hot_100_tracks))
4400

So we have our query strings formatted correctly for all the songs! You may have noticed that we only took the first word in the artist's name. In the next step, actually searching for the track, we found that a lot of songs from the Billboard charts had artist names that did not match the format that Spotify used (e.g. featurings like "Chris Brown Featuring Drake"). So a lot of songs failed to find a match. Song name formatting rarely varies between platforms, so by keeping that and chopping the rest of the artist's name, we matched a lot more songs to their Spotify track URIs. (As you can see below, we only had 71 failed matches out of 4400 total searches.)

The Spotify Web API also rate limited, so after 100 queries (the number of queries before they rate limited), we slept for 10 seconds before doing the next 100 songs.

In [0]:
hot_100_track_uris = []
failed_search = []
number_queries = 0
for track, song, artist, rank in hot_100_tracks:
  track_id = sp.search(q=track, type='track', limit=1)
  if(len(track_id['tracks']['items']) > 0):
    hot_100_track_uris.append((track_id['tracks']['items'][0]['uri'], song, artist, rank))
  else:
    failed_search.append(track_id)
  number_queries += 1
  if number_queries % 100 == 0:
    time.sleep(10)

print(failed_search)
print('Completed!')
In [0]:
print(len(failed_search))
print(len(hot_100_track_uris))

Finally, we are ready to query for what we were after from the beginning: audio features. Each audio feature query returns and Audio Features object which we store away for later use. Again, we must sleep in order to not spam the API.

In [0]:
hot_100_audio_features = []
num_queries = 0
for uri, song, artist, rank in hot_100_track_uris:
  af = sp.audio_features(tracks=[uri])[0]
  hot_100_audio_features.append((af, song, artist, rank))
  num_queries += 1
  if num_queries % 100 == 0:
    time.sleep(10)

print('Completed!')

We have: audio features and Billboard Hot 100 chart positions for every song. Now it's finally time to throw it all into a dataframe for analysis and visualization!

The audio features we chose to look closer at are: tempo, valence, loudness, energy, and key. We believe those were the most important features among the full suite of feature information we got: {duration_ms, key, mode, time_signature, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, valence, tempo}.

In [0]:
tracks = []
for af, song, artist, rank in hot_100_audio_features:
  tracks.append((song, artist, af['tempo'], af['valence'], af['loudness'], af['energy'], af['key'], rank))
hot_100_df = pd.DataFrame(tracks, columns=['Song', 'Artist', 'Tempo', 'Valence', 'Loudness', 'Energy', 'Key', 'Rank'])
hot_100_df
In [4]:
hot_100_df = pd.read_csv('hot_100.csv')

2.2 Remove Duplicates and Count Occurrences

Now that we have all the data, let's clean this up a bit more.

How many of these 4329 songs are actually unique songs? Songs can stay on the Billboard Hot 100 for several weeks!

In [5]:
print(hot_100_df.Song.nunique())
519

There are only 519 unique songs. Cleaning this data can help speed up our analysis.

For each duplicate song, we'll keep its highest ranking on the Hot 100 and create a "Count" column, keeping track of how many total weeks it has been on the Hot 100 since the beginning of 2019 (consecutive-ness of the weeks doesn't matter to us).

In [6]:
# Sorts all the songs by rank, then removes duplicates 
unique_songs_df = hot_100_df.sort_values('Rank', ascending=True).drop_duplicates(['Song','Artist'])

# Iterate over all the songs and add how many times they occur 
# in the Hot100  from Jan 2019 - Nov 2019
for song_name, count in hot_100_df.Song.value_counts().iteritems(): 
    unique_songs_df.loc[unique_songs_df['Song'] == song_name, 'Count'] = count

unique_songs_df.sort_values('Count', ascending=False)
Out[6]:
Unnamed: 0 Song Artist Tempo Valence Loudness Energy Key Rank Count
1289 1289 Wow. Post Malone 99.960 0.3880 -7.359 0.539 11 2 44.0
195 195 Sunflower (Spider-Man: Into The Spider-Verse) Post Malone & Swae Lee 89.911 0.9130 -5.574 0.479 2 1 44.0
1970 1970 Dancing With A Stranger Sam Smith & Normani 102.998 0.3470 -7.513 0.520 8 7 41.0
96 96 Without Me Halsey 136.041 0.5330 -7.050 0.488 6 1 41.0
2359 2359 Talk Khalid 135.984 0.3460 -8.575 0.400 0 3 37.0
1198 1198 Going Bad Meek Mill Featuring Drake 86.003 0.5440 -6.365 0.496 4 10 34.0
2070 2070 Sweet But Psycho Ava Max 133.002 0.6280 -4.724 0.704 1 10 34.0
593 593 Happier Marshmello & Bastille 100.015 0.6710 -2.749 0.792 5 2 34.0
2159 2159 Old Town Road Lil Nas X Featuring Billy Ray Cyrus 136.041 0.6390 -5.560 0.619 6 1 34.0
1491 1491 Better Khalid 97.949 0.1120 -10.278 0.552 0 8 34.0
990 990 Sucker Jonas Brothers 137.958 0.9520 -5.065 0.734 1 1 34.0
393 393 7 Rings Ariana Grande 140.048 0.3270 -10.732 0.317 1 1 33.0
892 892 Shallow Lady Gaga & Bradley Cooper 95.799 0.3230 -6.362 0.385 7 1 33.0
296 296 Sicko Mode Travis Scott 155.008 0.4460 -3.714 0.730 8 3 32.0
297 297 High Hopes Panic! At The Disco 82.014 0.6810 -2.729 0.904 5 4 32.0
3252 3252 Bad Guy Billie Eilish 135.128 0.5620 -10.965 0.425 7 1 30.0
2563 2563 Suge DaBaby 75.445 0.8440 -6.482 0.662 2 7 30.0
203 203 Eastside benny blanco, Halsey & Khalid 89.391 0.3190 -7.648 0.680 6 9 29.0
1894 1894 Look Back At It A Boogie Wit da Hoodie 96.057 0.5360 -5.075 0.587 3 27 29.0
1955 1955 Middle Child PnB Rock & XXXTENTACION 151.895 0.3980 -6.400 0.567 0 91 28.0
496 496 Middle Child J. Cole 123.984 0.4630 -11.713 0.364 8 4 28.0
2175 2175 Whiskey Glasses Morgan Wallen 149.959 0.7070 -4.580 0.680 6 17 27.0
2367 2367 Pop Out Polo G Featuring Lil Tjay 168.112 0.2610 -7.119 0.639 1 11 27.0
1707 1707 Envy Me Calboy 149.042 0.5840 -7.664 0.488 1 31 26.0
3209 3209 Talk You Out Of It Florida Georgia Line 119.964 0.5580 -4.791 0.708 4 57 26.0
1504 1504 Beautiful Crazy Luke Combs 103.313 0.3820 -7.431 0.402 11 21 25.0
2180 2180 Con Calma Daddy Yankee & Katy Perry Featuring Snow 93.989 0.6560 -2.652 0.860 8 22 25.0
119 119 Speechless Dan + Shay 135.929 0.3860 -5.968 0.438 1 24 25.0
1699 1699 Pure Water Mustard & Migos 202.015 0.1370 -5.545 0.559 0 23 25.0
3200 3200 Worth It YK Osiris 123.936 0.4220 -4.082 0.528 5 48 25.0
... ... ... ... ... ... ... ... ... ... ...
2536 2536 Sanctuary Joji 167.788 0.3160 -7.199 0.650 1 80 1.0
9 9 A Holly Jolly Christmas Burl Ives 140.467 0.8880 -13.056 0.375 0 10 1.0
3629 3629 Out Of Luck Lil Tecca 152.068 0.3370 -8.271 0.624 8 80 1.0
4020 4020 Chicken Noodle Soup j-hope Featuring Becky G. 97.053 0.1680 -4.081 0.817 2 81 1.0
3332 3332 Small Talk Katy Perry 115.999 0.6020 -5.547 0.631 5 81 1.0
3729 3729 homecoming queen? Kelsea Ballerini 114.014 0.2900 -5.443 0.512 9 82 1.0
475 475 F&N Future 160.940 0.3930 -4.766 0.628 1 83 1.0
3524 3524 Afterglow Taylor Swift 111.011 0.3990 -8.746 0.449 9 75 1.0
2829 2829 Costa Rica Dreamville Featuring Bas, JID, Guapdad 4000, R... 120.749 0.6640 -6.312 0.647 7 75 1.0
965 965 Wit It Gunna 125.905 0.5080 -5.318 0.657 7 75 1.0
3420 3420 Sup Mate Young Thug Featuring Future 147.043 0.5210 -7.951 0.552 1 70 1.0
3516 3516 Death By A Thousand Cuts Taylor Swift 94.071 0.3130 -6.754 0.732 4 67 1.0
1839 1839 Floating ScHoolboy Q Featuring 21 Savage 139.932 0.5430 -4.977 0.545 11 67 1.0
18 18 Let It Snow, Let It Snow, Let It Snow Dean Martin 134.005 0.7010 -14.014 0.240 1 20 1.0
3219 3219 Dreams Money Can Buy Drake 180.331 0.3300 -6.635 0.587 6 68 1.0
4105 4105 Stretch You Out Summer Walker Featuring A Boogie Wit da Hoodie 103.359 0.2520 -6.757 0.519 1 68 1.0
3618 3618 Babushka Boi A$AP Rocky 134.979 0.9050 -5.446 0.743 10 69 1.0
4009 4009 XXL DaBaby 143.975 0.7050 -5.333 0.655 6 69 1.0
65 65 Break Da Law 21 Savage 141.039 0.3590 -7.823 0.575 1 70 1.0
2129 2129 New Magic Wand Tyler, The Creator 139.566 0.4640 -5.414 0.730 5 70 1.0
1064 1064 Whip 2 Chainz Featuring Travis Scott 121.949 0.0379 -7.428 0.399 7 75 1.0
960 960 Outstanding Gunna 149.075 0.3240 -4.554 0.730 6 70 1.0
14 14 Rudolph The Red-Nosed Reindeer Gene Autry 142.110 0.6440 -14.056 0.159 8 16 1.0
1359 1359 Price On My Head NAV Featuring The Weeknd 74.949 0.2780 -6.104 0.574 6 72 1.0
962 962 Be Like Me Lil Pump Featuring Lil Wayne 104.006 0.5930 -8.080 0.396 5 72 1.0
4246 4246 Lose You To Love Me Selena Gomez 101.993 0.0916 -9.005 0.340 4 15 1.0
4013 4013 PROLLY HEARD DaBaby 124.052 0.9290 -2.388 0.783 1 73 1.0
964 964 3 Headed Snake Gunna Featuring Young Thug 135.030 0.6460 -10.708 0.415 6 74 1.0
2133 2133 A Boy Is A Gun Tyler, The Creator 79.568 0.5050 -8.302 0.689 2 74 1.0
1580 1580 Victory Lap Nipsey Hussle Featuring Stacy Barthe 89.146 0.0467 -2.209 0.755 4 100 1.0

524 rows × 10 columns

3. Exploratory Data Analysis & Visualization

Now that all the data has been collected and tidied up a bit, we want to see if certain song qualities affect a song's Hot 100 ranking. We decided to look at the following traits: tempo, valence, loudness, energy, and (musical) key. We chose these because we felt like popular music has gotten noisier and faster over the years (the rise of EDM and rave cultures) and wanted to see if that was actually true.

3.1.1 Tempo

First, we'll look at tempo and whether the Hot 100 rank of a song affects the tempo.

Tempo is measured in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. In other words, the higher the BPM, the faster the song is.

Let's look at a scatterplot and if we can observe anything.

In [7]:
# Sets size for the rest of our figures (i.e. graphs)
sns.set(rc={'figure.figsize':(25,10)})

tempo_rank_plot = unique_songs_df.plot.scatter(x='Rank', y='Tempo', marker='.', c='black')
tempo_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs Tempo(Jan 2019 -- Nov 2019)\n')
tempo_rank_plot.set_xlabel('Rank')
tempo_rank_plot.set_ylabel('Tempo (BPM)')
plt.show()

This graph... looks like a mess. It doesn't give me hope that the others will look significantly nicer. Let's graph it (and the following ones) as a violin plot and see if that helps.

In [8]:
sns.violinplot(x=unique_songs_df['Rank'], y=unique_songs_df['Tempo'])
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa76252fc88>

Okay... Looks mildly better. At least this way, we can tell that most of the violins are unimodal. We should note that the violins that look like a line indicate that the given rank only has songs of one tempo. This applies to future graphs as well, but with the corresponding trait.

Here are some observations we have:

  • Most of the violins are unimodal
  • But there's no trend in them becoming more unimodal as the ranking goes up
  • There's some serious skewing in the data
  • Most of the violins are not symmetric around the center

So, let's apply a linear regression to see if there's any strong trend to be found. We used sklearn to make this model. You can read more about it here.

In [12]:
%matplotlib inline

# Reshape the Rank and Tempo to be used in the linear regression
X = unique_songs_df['Rank'].values.reshape(-1,1)
y = unique_songs_df['Tempo'].values.reshape(-1,1)

# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)

# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Tempo(Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Temp (BPM)')
plt.show()

#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
Regression model intercept:
[116.68783729]
Regression model slope:
[[0.11431939]]

With a regression model slope of 0.11, we can say that there is almost no correlation between tempo and rank of a song.

Maybe there's a most popular tempo? Let's graph the usage of tempos and see! Here, we're using plotly bar graphs to make the graphs interactive so we can actually read the exact data values.

In [13]:
# Group the data by tempo
group_tempo = unique_songs_df.groupby('Tempo')

# Will hold tempo to the number of songs with that tempo
tempo_count_dict = {}

# for every tempo group, count the number of songs in
# that key, round the tempo float to the nearest integer, 
# then add it to tempo_count_dict
for name, group in group_tempo:
    if int(round(name)) in tempo_count_dict: 
        tempo_count_dict[int(round(name))] = tempo_count_dict[int(round(name))] + len(group.index)
    else: 
        tempo_count_dict[int(round(name))] = len(group.index)

tempo_count_dict

# Create a bar graph of tempos and how often they're used in the songs
df = pd.DataFrame(list(zip(tempo_count_dict.keys(), tempo_count_dict.values())), columns =['Tempo', 'Count']) 
fig = px.bar(df, x='Tempo', y='Count', labels={'x':'Tempo (BPM)', 'y':'Count'})
fig.show()

While there is no correlation between faster songs being more popular, there are certainly more popular tempos in the Hot 100. The most used seem to be 95, 100, 120, 130, 140, 150, and 160 BPM. They all are multiples of 5, which is interesting to note.

It seems like most composers whose songs have made it on the Hot 100 don't like using tempos that are more in-between. Also, visually, it looks like the more popular tempos are around 90-110 BPM and 140-160 BPM. Not a lot of popular songs are faster than 180 BPM or slower than 75 BPM.

3.1.2 Valence

Next, let's look at the effects of the Hot 100 rank of a song on valence.

According to Spotify:

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Let's see if people like happy or sad music more. First, the scatterplot and the violin plot!

In [14]:
# Sets size for the rest of our figures (i.e. graphs)
sns.set(rc={'figure.figsize':(25,10)})

valence_rank_plot = unique_songs_df.plot.scatter(x='Rank', y='Valence', marker='.', c='black')
valence_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Valence(Jan 2019 -- Nov 2019)\n')
valence_rank_plot.set_xlabel('Rank')
valence_rank_plot.set_ylabel('Valence')
plt.show()
In [15]:
sns.violinplot(x=unique_songs_df['Rank'], y=unique_songs_df['Valence'])
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa75c83e518>

Again, nothing in particular stands out here. Let's try linear regression to see if that shows us anything.

In [16]:
# Reshape the Rank and Valence to be used in the linear regression
X = unique_songs_df['Rank'].values.reshape(-1,1)
y = unique_songs_df['Valence'].values.reshape(-1,1)

# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)

# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Valence (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Valence')
plt.show()

#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
Regression model intercept:
[0.51924336]
Regression model slope:
[[-0.00076861]]

This regression slope of -0.00077 shows an even weaker correlation between happiness of a song and song popularity than the one with tempo.

Now, we should look at the frequency distribution of the valences.

In [17]:
# Group the data by valance
group_valence = unique_songs_df.groupby('Valence')

# Will hold valence to the number of songs with that valence
valence_count_dict = {}

# for every valence group, count the number of songs in
# that key, round the valence float to the nearest tenth, 
# then add it to valence_count_dict
for name, group in group_valence:
    if round(name, 1) in valence_count_dict: 
        valence_count_dict[round(name, 1)] = valence_count_dict[round(name, 1)] + len(group.index)
    else: 
        valence_count_dict[round(name, 1)] = len(group.index)

# Create a bar graph of valences and how often they're used in the songs
df = pd.DataFrame(list(zip(valence_count_dict.keys(), valence_count_dict.values())), columns =['Valence', 'Count']) 
fig = px.bar(df, x='Valence', y='Count', labels={'x':'Valence', 'y':'Count'})
fig.show()

This valence distrubution doesn't really tell us much. This is because the given distribution of valences for songs provided by Spotify looks like the following: Image of Spotify Valence

Our valence bar graph seems to reflect that distribution, having similar peaks at 0.4 and 0.6, so we can't draw any conclusions from this.

3.1.3 Loudness

Maybe loudness affects the popularity of a song?

Loudness is measured in decibels (dB). Spotify says that "Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude)." The values typically range between -60 and 0 dB.

In [18]:
# Sets size for the rest of our figures (i.e. graphs)
sns.set(rc={'figure.figsize':(25,10)})

loudness_rank_plot = unique_songs_df.plot.scatter(x='Rank', y='Loudness', marker='.', c='black')
loudness_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Loudness (Jan 2019 -- Nov 2019)\n')
loudness_rank_plot.set_xlabel('Rank')
loudness_rank_plot.set_ylabel('Loudness (dB)')
plt.show()
In [19]:
sns.violinplot(x=unique_songs_df['Rank'], y=unique_songs_df['Loudness'])
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa7568e14a8>

We can observe that most songs don't drop under -20 dB. However, like before, we're not really seeing anything in particular standing out. There is a lot of skewing happening in the violin plots, though.

This doesn't have me hopeful for the regression line, but let's run a linear regression analysis anyway just to see.

In [20]:
# Reshape the Rank and Loudness to be used in the linear regression
X = unique_songs_df['Rank'].values.reshape(-1,1)
y = unique_songs_df['Loudness'].values.reshape(-1,1)

# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)

# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Loudness (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Loudness')
plt.show()

#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
Regression model intercept:
[-6.43998515]
Regression model slope:
[[0.00121108]]

Yep, the linear regression model tells us nothing much. With a regression slope of 0.001, there is almost no correlation here.

How about the frequency bar graph?

In [21]:
# Group the data by loudness
group_loudness = unique_songs_df.groupby('Loudness')

# Will hold loudness to the number of songs with that tempo
loudness_count_dict = {}

# for every loudness group, count the number of songs in
# that key, round the loudness float to the nearest tenth, 
# then add it to loudness_count_dict
for name, group in group_loudness:
    if round(name, 1) in loudness_count_dict: 
        loudness_count_dict[round(name, 1)] = loudness_count_dict[round(name, 1)] + len(group.index)
    else: 
        loudness_count_dict[round(name, 1)] = len(group.index)

# Create a bar graph of loudness and how often they're used in the songs
df = pd.DataFrame(list(zip(loudness_count_dict.keys(), loudness_count_dict.values())), columns =['Loudness', 'Count']) 
fig = px.bar(df, x='Loudness', y='Count', labels={'x':'Loudness (dB)', 'y':'Count'})
fig.show()

This is what the given distribution of loudness for songs provided by Spotify looks like: Image of Spotify Loudness

In Spotify's distribution, there is only one peak around the -10 to -5 dB range, ending right before -10. In our distribution, there seems to be about two peaks: one in the -5.8 to -4.5 range and a smaller peak in the -7.2 to -6.3 range. This reflects the Spotify distribution. So, like the valence bar graph, no conclusions can be drawn from this.

3.1.4 Energy

Can the Hot 100 rank affect energy?

Spotify describes energy as so:

Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

The higher the number, the more "energetic" the song is.

In [22]:
# Sets size for the rest of our figures (i.e. graphs)
sns.set(rc={'figure.figsize':(25,10)})

energy_rank_plot = unique_songs_df.plot.scatter(x='Rank', y='Energy', marker='.', c='black')
energy_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Energy (Jan 2019 -- Nov 2019)\n')
energy_rank_plot.set_xlabel('Rank')
energy_rank_plot.set_ylabel('Energy')
plt.show()
In [23]:
sns.violinplot(x=unique_songs_df['Rank'], y=unique_songs_df['Energy'])
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa75637bb70>

Again, consistent with the above traits in that the scatter and violin plots tell us nothing useful. Let's look at the linear regression.

In [24]:
# Reshape the Rank and Energy to be used in the linear regression
X = unique_songs_df['Rank'].values.reshape(-1,1)
y = unique_songs_df['Energy'].values.reshape(-1,1)

# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)

# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Energy (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Energy')
plt.show()

#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
Regression model intercept:
[0.61902655]
Regression model slope:
[[6.61601555e-05]]

This is our weakest correlation yet. The regression slope is 6.61e-05. It is so incredibly close to 0 that it's safe to say that ranking does not directly correlate with the energy of a song.

Maybe the frequency distribution will show us something.

In [25]:
# Group the data by energy
group_energy = unique_songs_df.groupby('Energy')

# Will hold energy to the number of songs with that energy
energy_count_dict = {}

# for every energy group, count the number of songs in
# that key, round the energy float to the nearest integer, 
# then add it to energy_count_dict
for name, group in group_energy:
    if round(name, 1) in energy_count_dict: 
        energy_count_dict[round(name, 1)] = energy_count_dict[round(name, 1)] + len(group.index)
    else: 
        energy_count_dict[round(name, 1)] = len(group.index)

# Create a bar graph of energy and how often they're used in the songs
df = pd.DataFrame(list(zip(energy_count_dict.keys(), energy_count_dict.values())), columns =['Energy', 'Count']) 
fig = px.bar(df, x='Energy', y='Count', labels={'x':'Energy', 'y':'Count'})
fig.show()

This is the Spotify distribution of energies in songs: Image of Spotify Energy

Comparing our distribution to the Spotify one, there are some noticeable differences. Our distribution seems to peak at the 0.6 and 0.7 ranges, rather than the 0.8 area. This could potentially indicate that popular songs occur more in the middle to middle-high energy songs. Also, there are less songs in our distribution in the 0.8 to 1.0 range than in the Spotify one, perhaps indicating that songs that are too energetic are less popular and will not place on the Billboard Hot 100.

3.1.5 Musical Key

None of the other traits seem to correlate that strongly with song rankings. How about the musical key of a song? Maybe some keys are more popular than others?

Spotify estimates the overall key of a track. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

In [26]:
music_key_rank_plot = unique_songs_df.plot.scatter(x='Rank', y='Key', marker='.', c='black')
music_key_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Musical Key (Jan 2019 -- Nov 2019)\n')
music_key_rank_plot.set_xlabel('Rank')
music_key_rank_plot.set_ylabel('Musical Key')
plt.show()
In [27]:
sns.violinplot(x=hot_100_df['Rank'], y=hot_100_df['Key'])
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa755d13160>
In [28]:
# Reshape the Rank and Key to be used in the linear regression
X = unique_songs_df['Rank'].values.reshape(-1,1)
y = unique_songs_df['Key'].values.reshape(-1,1)

# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)

# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Key (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Key')
plt.show()

#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
Regression model intercept:
[4.52239574]
Regression model slope:
[[0.0105393]]

Okay, so higher keys don't correlate with the popularity of a song. How about if we see how many songs are in which key? Maybe some keys are more popular than others.

In [29]:
# Group the data by musical key
group_musc_key = unique_songs_df.groupby('Key')

# Will hold musical key to the number of songs in that key
musc_key_count_dict = {}
# Integers map to pitches using standard Pitch Class notation 
musc_key_dict = {0:'C', 1:'C♯/D♭', 2:'D', 3:'D♯/E♭', 4:'E', 5:'F', 6:'F♯/G♭', 7:'G', 8:'G♯/A♭', 9:'A', 10:'A♯/B♭', 11:'B'}

# for every musical key group, count the number of songs in
# that key, then add it to musc_key_count_dict
for name, group in group_musc_key:
    musc_key_count_dict[musc_key_dict[name]] = len(group.index)

# Create a bar graph of music keys and how often they're used in the songs
df = pd.DataFrame(list(zip(musc_key_dict.values(), musc_key_count_dict.values())), columns =['Music Key', 'Count']) 
fig = px.bar(df, x='Music Key', y='Count', labels={'x':'Music Key', 'y':'Count'})
fig.show()

Wow! So there is definitely a most popular and least popular key among the Hot 100 songs. The most popular key is C♯/D♭ and the least popular one is D♯/E♭.

Note: The reason why there are two names for the same key is because they're referred to by two different names, depending on whether the overarching key is major or minor.

3.1.6 Song Trait Conclusions

Since there's no strong correlation linking any of the song traits we looked at with song ranking, we are unable to draw many conclusions about the relationship between the traits and Hot 100 rankings that we can do predictive analyis on. We could only find some commonalities like most popular music key and popular tempos. So, we'll look into other means of analysis.

We did not find strong correlations between rank and song traits, so let's analyze the 50 songs that spent the most time on the Hot 100 and see if we can find any trends in their song traits.

3.2.1 Tempo

In [30]:
top_songs_df = unique_songs_df.sort_values('Count', ascending=False).head(50)
top_songs_df
Out[30]:
Unnamed: 0 Song Artist Tempo Valence Loudness Energy Key Rank Count
1289 1289 Wow. Post Malone 99.960 0.3880 -7.359 0.539 11 2 44.0
195 195 Sunflower (Spider-Man: Into The Spider-Verse) Post Malone & Swae Lee 89.911 0.9130 -5.574 0.479 2 1 44.0
1970 1970 Dancing With A Stranger Sam Smith & Normani 102.998 0.3470 -7.513 0.520 8 7 41.0
96 96 Without Me Halsey 136.041 0.5330 -7.050 0.488 6 1 41.0
2359 2359 Talk Khalid 135.984 0.3460 -8.575 0.400 0 3 37.0
1198 1198 Going Bad Meek Mill Featuring Drake 86.003 0.5440 -6.365 0.496 4 10 34.0
2070 2070 Sweet But Psycho Ava Max 133.002 0.6280 -4.724 0.704 1 10 34.0
593 593 Happier Marshmello & Bastille 100.015 0.6710 -2.749 0.792 5 2 34.0
2159 2159 Old Town Road Lil Nas X Featuring Billy Ray Cyrus 136.041 0.6390 -5.560 0.619 6 1 34.0
1491 1491 Better Khalid 97.949 0.1120 -10.278 0.552 0 8 34.0
990 990 Sucker Jonas Brothers 137.958 0.9520 -5.065 0.734 1 1 34.0
393 393 7 Rings Ariana Grande 140.048 0.3270 -10.732 0.317 1 1 33.0
892 892 Shallow Lady Gaga & Bradley Cooper 95.799 0.3230 -6.362 0.385 7 1 33.0
296 296 Sicko Mode Travis Scott 155.008 0.4460 -3.714 0.730 8 3 32.0
297 297 High Hopes Panic! At The Disco 82.014 0.6810 -2.729 0.904 5 4 32.0
3252 3252 Bad Guy Billie Eilish 135.128 0.5620 -10.965 0.425 7 1 30.0
2563 2563 Suge DaBaby 75.445 0.8440 -6.482 0.662 2 7 30.0
203 203 Eastside benny blanco, Halsey & Khalid 89.391 0.3190 -7.648 0.680 6 9 29.0
1894 1894 Look Back At It A Boogie Wit da Hoodie 96.057 0.5360 -5.075 0.587 3 27 29.0
1955 1955 Middle Child PnB Rock & XXXTENTACION 151.895 0.3980 -6.400 0.567 0 91 28.0
496 496 Middle Child J. Cole 123.984 0.4630 -11.713 0.364 8 4 28.0
2175 2175 Whiskey Glasses Morgan Wallen 149.959 0.7070 -4.580 0.680 6 17 27.0
2367 2367 Pop Out Polo G Featuring Lil Tjay 168.112 0.2610 -7.119 0.639 1 11 27.0
1707 1707 Envy Me Calboy 149.042 0.5840 -7.664 0.488 1 31 26.0
3209 3209 Talk You Out Of It Florida Georgia Line 119.964 0.5580 -4.791 0.708 4 57 26.0
1504 1504 Beautiful Crazy Luke Combs 103.313 0.3820 -7.431 0.402 11 21 25.0
2180 2180 Con Calma Daddy Yankee & Katy Perry Featuring Snow 93.989 0.6560 -2.652 0.860 8 22 25.0
119 119 Speechless Dan + Shay 135.929 0.3860 -5.968 0.438 1 24 25.0
1699 1699 Pure Water Mustard & Migos 202.015 0.1370 -5.545 0.559 0 23 25.0
3200 3200 Worth It YK Osiris 123.936 0.4220 -4.082 0.528 5 48 25.0
3450 3450 Truth Hurts Lizzo 158.087 0.4120 -3.046 0.624 4 1 25.0
1965 1965 I Don't Care Ed Sheeran & Justin Bieber 101.956 0.8420 -5.041 0.675 6 2 24.0
4232 4232 Someone You Loved Lewis Capaldi 109.891 0.4460 -5.679 0.405 1 1 24.0
1414 1414 When The Party's Over Billie Eilish 82.642 0.1980 -14.084 0.111 4 29 23.0
603 603 A Lot 21 Savage 145.972 0.2740 -7.643 0.636 1 12 23.0
1869 1869 If I Can't Have You Shawn Mendes 123.911 0.8640 -4.198 0.809 2 2 23.0
2673 2673 God's Country Blake Shelton 144.586 0.4340 -10.380 0.325 2 17 23.0
3580 3580 Knockin' Boots Luke Bryan 131.983 0.6340 -3.728 0.682 2 31 22.0
300 300 Girls Like You Maroon 5 Featuring Cardi B 124.959 0.4480 -6.825 0.541 0 7 22.0
2205 2205 Close Friends Lil Baby 158.896 0.6980 -4.669 0.578 10 47 22.0
2869 2869 Hey Look Ma, I Made It Panic! At The Disco 107.936 0.5800 -3.337 0.833 5 16 22.0
2997 2997 GIRL Maren Morris 143.599 0.4470 -3.950 0.793 6 44 21.0
2382 2382 Act Up City Girls 97.075 0.3130 -4.713 0.638 8 26 21.0
1211 1211 Be Alright Dean Lewis 126.684 0.4430 -6.319 0.586 11 23 21.0
3173 3173 Beer Never Broke My Heart Luke Combs 77.000 0.6400 -4.483 0.863 1 21 21.0
3959 3959 Trampoline SHAED 126.803 0.4980 -5.782 0.459 7 18 21.0
0 0 Thank U, Next Ariana Grande 106.966 0.4120 -5.634 0.653 1 1 21.0
103 103 Drip Too Hard Lil Baby & Gunna 112.511 0.3890 -6.903 0.662 1 8 21.0
3847 3847 Ran$om Lil Tecca 105.014 0.0774 -7.844 0.813 11 4 21.0
1333 1333 Put A Date On It Yo Gotti Featuring Lil Baby 129.930 0.5790 -5.696 0.659 4 46 20.0
In [31]:
tempo_rank_plot = top_songs_df.plot.scatter(x='Rank', y='Tempo', marker='.', c='black')
tempo_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs Tempo (Top 50)\n')
tempo_rank_plot.set_xlabel('Rank')
tempo_rank_plot.set_ylabel('Tempo (BPM)')
plt.show()

Once again, the scatterplot doesn't really say much.

In [32]:
sns.set(rc={'figure.figsize':(25,10)})
sns.violinplot(x=top_songs_df['Rank'], y=top_songs_df['Tempo'])
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa756b2f898>
In [33]:
# Reshape the year and lifeExp to be used in the linear regression
X = top_songs_df['Rank'].values.reshape(-1,1)
y = top_songs_df['Tempo'].values.reshape(-1,1)

# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)

# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Tempo(Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Temp (BPM)')
plt.show()

#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
Regression model intercept:
[115.96158287]
Regression model slope:
[[0.32986549]]

With a regression model slope of 0.33, there is a higher correlation within the top 50 than there was with our entire dataset, but it still isn't very strong.

In [34]:
# Group the data by musical key
group_tempo = top_songs_df.groupby('Tempo')

# Will hold tempo to the number of songs with that tempo
tempo_count_dict = {}

# for every tempo group, count the number of songs in
# that key, round the tempo float to the nearest integer, 
# then add it to tempo_count_dict
for name, group in group_tempo:
    if int(round(name)) in tempo_count_dict: 
        tempo_count_dict[int(round(name))] = tempo_count_dict[int(round(name))] + len(group.index)
    else: 
        tempo_count_dict[int(round(name))] = len(group.index)

tempo_count_dict

# Create a bar graph of tempos and how often they're used in the songs
df = pd.DataFrame(list(zip(tempo_count_dict.keys(), tempo_count_dict.values())), columns =['Tempo (BPM)', 'Count'])
fig = px.bar(df, x='Tempo (BPM)', y='Count', labels={'x':'Tempo (BPM)', 'y':'Count'})
fig.show()

Similar to the results from the entire dataset, there is a visual increase in popularity from 96-103 BPM and from 124-136 BPM. Most significant is the 4 songs out of the top 50 which have a BPM of 136, which accounts for 8% of the most popular songs. There are also only 2 songs with a BPM higher than 159; one is at 168, and the other is much more of an outlier at 202 BPM.

3.2.2 Valence

In [35]:
valence_rank_plot = top_songs_df.plot.scatter(x='Rank', y='Valence', marker='.', c='black')
valence_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Valence (Top 50)\n')
valence_rank_plot.set_xlabel('Rank')
valence_rank_plot.set_ylabel('Valence')
plt.show()
In [36]:
sns.violinplot(x=top_songs_df['Rank'], y=top_songs_df['Valence'])
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa754fc81d0>
In [37]:
# Reshape the Rank and Valence to be used in the linear regression
X = top_songs_df['Rank'].values.reshape(-1,1)
y = top_songs_df['Valence'].values.reshape(-1,1)

# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)

# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Valence (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Valence')
plt.show()

#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
Regression model intercept:
[0.50650336]
Regression model slope:
[[-0.00078578]]

Again, the relationship between song ranking and valence is very weak.

In [38]:
# Group the data by valance
group_valence = top_songs_df.groupby('Valence')

# Will hold valence to the number of songs with that valence
valence_count_dict = {}

# for every valence group, count the number of songs in
# that key, round the valence float to the nearest tenth, 
# then add it to valence_count_dict
for name, group in group_valence:
    if round(name, 1) in valence_count_dict: 
        valence_count_dict[round(name, 1)] = valence_count_dict[round(name, 1)] + len(group.index)
    else: 
        valence_count_dict[round(name, 1)] = len(group.index)

# Create a bar graph of valences and how often they're used in the songs
df = pd.DataFrame(list(zip(valence_count_dict.keys(), valence_count_dict.values())), columns =['Valence', 'Count']) 
fig = px.bar(df, x='Valence', y='Count', labels={'x':'Valence', 'y':'Count'})
fig.show()

Here, we can see a slight trend towards mid to lower valences, meaning songs that are neutral or slightly positive stay the longest on the chart.

3.2.3 Loudness

In [39]:
loudness_rank_plot = top_songs_df.plot.scatter(x='Rank', y='Loudness', marker='.', c='black')
loudness_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Loudness (Jan 2019 -- Nov 2019)\n')
loudness_rank_plot.set_xlabel('Rank')
loudness_rank_plot.set_ylabel('Loudness (dB)')
plt.show()
In [40]:
sns.violinplot(x=top_songs_df['Rank'], y=top_songs_df['Loudness'])
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa754edd400>
In [41]:
# Reshape the Rank and Loudness to be used in the linear regression
X = top_songs_df['Rank'].values.reshape(-1,1)
y = top_songs_df['Loudness'].values.reshape(-1,1)

# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)

# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Loudness (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Loudness')
plt.show()

#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
Regression model intercept:
[-6.53859314]
Regression model slope:
[[0.01804932]]

The regression model tells us that there is still no correlation.

In [42]:
# Group the data by loudness
group_loudness = top_songs_df.groupby('Loudness')

# Will hold loudness to the number of songs with that tempo
loudness_count_dict = {}

# for every loudness group, count the number of songs in
# that key, round the loudness float to the nearest tenth, 
# then add it to loudness_count_dict
for name, group in group_loudness:
    if round(name, 1) in loudness_count_dict: 
        loudness_count_dict[round(name, 1)] = loudness_count_dict[round(name, 1)] + len(group.index)
    else: 
        loudness_count_dict[round(name, 1)] = len(group.index)

# Create a bar graph of loudness and how often they're used in the songs
df = pd.DataFrame(list(zip(loudness_count_dict.keys(), loudness_count_dict.values())), columns =['Loudness', 'Count']) 
fig = px.bar(df, x='Loudness', y='Count', labels={'x':'Loudness (dB)', 'y':'Count'})
fig.show()

Of the top 50 songs, the majority have a loudness above -8. This is still similar to Spotify's distribution.

3.2.4 Energy

In [43]:
energy_rank_plot = top_songs_df.plot.scatter(x='Rank', y='Energy', marker='.', c='black')
energy_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Energy (Top 50)\n')
energy_rank_plot.set_xlabel('Rank')
energy_rank_plot.set_ylabel('Energy')
plt.show()
In [44]:
sns.violinplot(x=top_songs_df['Rank'], y=top_songs_df['Energy'])
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa754aa2278>
In [45]:
# Reshape the Rank and Energy to be used in the linear regression
X = top_songs_df['Rank'].values.reshape(-1,1)
y = top_songs_df['Energy'].values.reshape(-1,1)

# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)

# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Energy (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Energy')
plt.show()

#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
Regression model intercept:
[0.58808983]
Regression model slope:
[[0.00023198]]
In [46]:
# Group the data by energy
group_energy = top_songs_df.groupby('Energy')

# Will hold energy to the number of songs with that energy
energy_count_dict = {}

# for every energy group, count the number of songs in
# that key, round the energy float to the nearest integer, 
# then add it to energy_count_dict
for name, group in group_energy:
    if round(name, 1) in energy_count_dict: 
        energy_count_dict[round(name, 1)] = energy_count_dict[round(name, 1)] + len(group.index)
    else: 
        energy_count_dict[round(name, 1)] = len(group.index)

# Create a bar graph of energy and how often they're used in the songs
df = pd.DataFrame(list(zip(energy_count_dict.keys(), energy_count_dict.values())), columns =['Energy', 'Count']) 
fig = px.bar(df, x='Energy', y='Count', labels={'x':'Energy', 'y':'Count'})
fig.show()

This looks similar to our original energy bar graph, with a peak at 0.7 and large drop at 0.8. This could mean that middle to middle-high energy songs are indeed the most popular.

3.2.5 Musical Key

In [47]:
music_key_rank_plot = top_songs_df.plot.scatter(x='Rank', y='Key', marker='.', c='black')
music_key_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Musical Key (Top 50)\n')
music_key_rank_plot.set_xlabel('Rank')
music_key_rank_plot.set_ylabel('Musical Key')
plt.show()
In [48]:
sns.violinplot(x=top_songs_df['Rank'], y=hot_100_df['Key'])
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa75fe74780>
In [49]:
# Reshape the Rank and Key to be used in the linear regression
X = top_songs_df['Rank'].values.reshape(-1,1)
y = top_songs_df['Key'].values.reshape(-1,1)

# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)

# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Key (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Key')
plt.show()

#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
Regression model intercept:
[4.41487339]
Regression model slope:
[[-0.00714387]]
In [50]:
# Group the data by musical key
group_musc_key = top_songs_df.groupby('Key')

# Will hold musical key to the number of songs in that key
musc_key_count_dict = {}
# Integers map to pitches using standard Pitch Class notation 
musc_key_dict = {0:'C', 1:'C♯/D♭', 2:'D', 3:'D♯/E♭', 4:'E', 5:'F', 6:'F♯/G♭', 7:'G', 8:'G♯/A♭', 9:'A', 10:'A♯/B♭', 11:'B'}

# for every musical key group, count the number of songs in
# that key, then add it to musc_key_count_dict
for name, group in group_musc_key:
    musc_key_count_dict[musc_key_dict[name]] = len(group.index)

# Create a bar graph of music keys and how often they're used in the songs
df = pd.DataFrame(list(zip(musc_key_dict.values(), musc_key_count_dict.values())), columns =['Music Key', 'Count']) 
fig = px.bar(df, x='Music Key', y='Count', labels={'x':'Music Key', 'y':'Count'})
fig.show()

Like with the larger dataset, C♯/D♭ is the most popular key, but this time by a much larger margin in comparison. The least popular one is still D♯/E♭, and A is significantly low.

3.2.6 Top 50 Conclusions

Among the to 50 songs, we can see that C♯/D♭ is the most popular key, while D♯/E♭ and A are the least popular. Valences generally were between 0.3 and 0.6, denoting a slightly positive or neutral sound. Middle to middle-high energy levels, from 0.4 to 0.7, are the most popular. Most of the top 50 had a loudness from -8 to -2.7 dB. Even most of these do not deviate significantly from Spotify's aggregate data, they reflect the songs that people listen to the most.

3.3 Grouping by Artists

So these individual audio features don't show that much correlation in the grand scheme of things. However, 100 songs is a lot of songs; there is definitely room for diversity in songs in enter the chart. Maybe finding the artists that constantly have top hits and analyzing their music would be helpful. We defined the most popular artists to be the ones that have the most songs in the Hot 100.

Below we found all the unique artist names in the Hot 100 charts, but we realized a lot of the artist names included featuring artists in the form "Artist1 Featuring Artist2", so we decided to just include that in the count of songs for "Artist1".

In [51]:
unique_artists = hot_100_df.Artist.unique()

# grab all unique artists, disregard 'Featuring'
for i in range(0, len(unique_artists)):
  if "Featuring" in unique_artists[i]:
    idx = unique_artists[i].find('Featuring')
    unique_artists[i] = unique_artists[i][:idx-1]

unique_artists = list(set(unique_artists))

After getting the unique artists, we need to count the number of songs they have in the Hot 100 charts to determine the most popular artists.

In [52]:
# count how many songs each artist has had on the Hot 100 this year
artist_songs_in_chart = {}
for artist in unique_artists:
  song_count = len(unique_songs_df[unique_songs_df['Artist'].str.contains(artist)])
  artist_songs_in_chart[artist] = song_count
In [53]:
# sort list of artists in descending order for # of songs in Hot 100 over the year
artist_songs_in_chart_sorted = sorted(artist_songs_in_chart.items(), key=lambda kv: kv[1], reverse=True)
In [54]:
# select the top 5 artists to look at
selected_artists = []
for i in range(0, 5):
  selected_artists.append(artist_songs_in_chart_sorted[i][0])
  print(artist_songs_in_chart_sorted[i][0] + ": " + str(artist_songs_in_chart_sorted[i][1]) + " songs")
DaBaby: 20 songs
Taylor Swift: 18 songs
Post Malone: 18 songs
Ariana Grande: 17 songs
Future: 15 songs

Sorting by number of unique songs in the charts gives us the above 5 top artists!

So now we can analyze the features of their songs. Below we plotted each of the artists' songs' ranks in relation to the five different audio features and added a regression line to see general trend.

In [55]:
X = unique_songs_df['Rank'].values.reshape(-1,1)
audio_features = ['Tempo', 'Valence', 'Loudness', 'Energy', 'Key']
for artist in selected_artists:
  artist_songs_df = unique_songs_df[unique_songs_df['Artist'].str.contains(artist)]
  fig = plt.figure()
  fig.subplots_adjust(hspace=.4, wspace=.4)
  fig.suptitle(artist + ": Billboard Hot 100 Rank of a Song vs. Audio Features (Jan 2019 -- Nov 2019)\n")
  for i in range(0, len(audio_features)):
    af = audio_features[i]
    y = unique_songs_df[af].values.reshape(-1,1)

    # Create Linear Regression based on plot
    regressor = LinearRegression()
    regressor.fit(X, y)
    ax = fig.add_subplot(2, 3, i+1)
    ax.scatter(artist_songs_df['Rank'], artist_songs_df[af], color = 'black')
    ax.plot(X, regressor.predict(X), color = 'red')
    ax.set_xlabel('Rank')
    ax.set_ylabel(af)
  plt.show()

From the graphs, we can see that there is a very slight trend with tempo and valence. More specifically, in these top 5 artists of 2019, their slower tempo songs tended to rank higher, and the songs with more valence also rank higher. All things considered, however, the tempo of these artists' songs are not that slow, just that the relatively slower ones tend to rank higher. Valence also makes sense in this context since higher valence means more positive songs, which means that people tend to like the more positive songs of these artists. There does not seem to be much correlation between rank and the other audio features.

Since the "key" feature has fixed values, why don't we take a look at what keys these artists release most of their hits in?

In [56]:
for artist in selected_artists:
  artist_songs_df = unique_songs_df[unique_songs_df['Artist'].str.contains(artist)]
  af = 'Key'
  groupby_key = artist_songs_df.groupby(af)
  names = []
  values = []
  for name, group in groupby_key:
    names.append(name)
    values.append(len(group))

  plt.bar(names, values, align='center')
  plt.title(artist + ": Billboard Hot 100 Key Usage in Songs (Jan 2019 -- Nov 2019)\n")
  plt.xlabel(af)
  plt.ylabel('Number of Occurrences')
  plt.show()

Wow! For four out of the five artists, keys 0 and 1 (C and C♯/D♭) were used the most. This could be interpreted as the keys that the public is most receptive to. However, Billie Eilish clearly has had success with most of her Hot 100 hits being in the 4th key: E.

So how does this relate to all of the other songs in the Hot 100? Mentioned earlier in this tutorial, the most popular key among all the songs is C♯/D♭ which corresponds to what we have found here in the most popular artists. D♯/E♭ (key 3) was found to be the least popular key, and that is also reflected here: most artists did not release/chart a song in key 3 or, if they did, there were very few of them.

4. Insights & Conclusions

From our data exploration and analysis, we have unfortunately found out that there is not much correlation between the audio features of a song and how it ranks on the Billboard Hot 100 charts. The graphs of each audio feature vs. rank show a pretty random and even distribution of values, so we cannot conclude anything from the data. However, when we looked at general count of certain audio feature values, we found that there were some values that took the lead.

For musical key, the data showed that C♯/D♭ is by far the most popular key for songs that made it into the Hot 100 charts. In addition, D♯/E♭ turned out to be the least used key among the charting songs. From this, it could go several ways.

  • the public is most receptive to songs in C♯/D♭
  • people are less likely to like songs in D♯/E♭
  • it's easiest to make songs in C♯/D♭
  • it's hardest to make songs in D♯/E♭

Without a lot of musical knowledge, it is hard to make conclusions about the last two bullet points. But, we did find a pie chart of all the keys used in Spotify songs:

alt text

http://thekeyofone.com/blog/the-most-and-least-used-keys-of-music-in-spotify/

This shows us that the most used key is in fact G. C♯ is not even close to being the most used key in all of Spotify. Granted, this chart is from 2016, so there could be changing musical trends, but we believe this is a valid argument that it is not necessarily the easiest to make songs in C♯/D♭.

In order to take into account skewing of data from less popular songs and artists (therefore less telling of general taste), we looked at both the songs that stayed in the Hot 100 chart this year the longest and also the songs of the most popular artists.

Looking at the top 50 songs that stayed in the chart the longest, we see that C♯/D♭ is also the most used key among them and D♯/E♭ and A are least used. We also see that middle to middle-high energy levels, from 0.4 to 0.7, are the most popular. Most of the top 50 had a loudness from -8 to -2.7 dB. There is a preference for songs that are neutral to slightly positive.

When analyzing the songs of the most popular artists, we found the trends to be the same as the trends in all of the Hot 100. The musical key distribution reflected the same trends of popular C♯/D♭ and least popular D♯/E♭.

5. Future Exploration

For future analysis of songs, we would definitely be interested in exploring these same trends over the course of the whole decade--Jan 2010 to December 2019--to see if anything has actually changed musically over time. Additionally, it would be interesting to see if there are any trends in the musical qualities of songs depending on which season they were released in. For example, are winter song releases less energetic than summer releases?

Something we would also like to be able to do is be able to apply some kind of machine learning model to a larger data set to see if we are able to predict a song's Billboard Hot 100 placement based on its sound qualities or its artists' prior performance on the charts. If a successful model could be made, we would also like to see if that could also accurately predict foreign artists' success on the Hot 100, e.g. "Despacito" by Luis Fonsi and Daddy Yankee or "Boy With Luv" by BTS ft. Halsey.

In [0]: